Far field automatic speech recognition pre-processing

ABSTRACT

System and techniques for automatic speech recognition pre-processing are described herein. First, a plurality of audio channels may be obtained. Then, reverberations mat be removed from the audio channels. The plurality of audio channels may be partitioned into beams after reverberations are removed. A partition corresponding to a beam in the beams may be selected based on a noise level. An audio signal may be filtered from the selected partition. The filtered audio signal may be provided to an external entity via an output interface of the pre-processing pipeline.

CLAIM OF PRIORITY

This patent application claims the benefit of priority, under 35 U.S.C.§119, to U.S. Provisional Application Ser. No. 62/350,507, titled “FARFIELD AUTOMATIC SPEECH RECOGNITION” and filed on Jun. 15, 2016, theentirety of which is hereby incorporated by reference herein.

TECHNICAL FIELD

Embodiments described herein generally relate to automatic speechrecognition (ASR) and more specifically to improving ASR pre-processing.

BACKGROUND

ASR involves a machine-based collection of techniques to understandhuman languages. ASR is interdisciplinary, often involving microphone,analog to digital conversion, frequency processing, database, andartificial intelligence technologies to convert the spoken word intotextual or machine readable representations of not only what said (e.g.,a transcript) but also what was meant (e.g., semantic understanding) bya human speaker. Far field ASR involves techniques to decrease a worderror rate (WER) in utterances made a greater distance to a microphone,or microphone array, than traditionally accounted for in ASR processingpipelines. Such distance often decreases the signal to noise (SNR) ratioand thus increases WER in traditional ASR systems. As used herein, farfield ASR involves distances more than half meter from the microphone.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, which are not necessarily drawn to scale, like numeralsmay describe similar components in different views. Like numerals havingdifferent letter suffixes may represent different instances of similarcomponents. The drawings illustrate generally, by way of example, butnot by way of limitation, various embodiments discussed in the presentdocument.

FIG. 1 is an example of a smart home gateway housing, according to anembodiment.

FIG. 2 is a block diagram of an example of a system for far fieldautomatic speech recognition pre-processing, according to an embodiment.

FIG. 3 illustrates phase-based beam forming (PBF) directivity patterns,according to an embodiment.

FIG. 4 is a plot of far field ASR WER improvements for different typesof noise, according to an embodiment.

FIG. 5 illustrates an example of a method for automatic speechrecognition pre-processing, according to an embodiment.

FIG. 6 is a block diagram illustrating an example of a machine uponwhich one or more embodiments may be implemented.

DETAILED DESCRIPTION

Embodiments and examples herein general described a number of systems,devices, and techniques for automatic speech recognition pre-processing.It is understood, however, that the systems, devices, and techniques areexamples illustrating the underlying concepts.

FIG. 1 is an example of a smart home gateway 105, according to anembodiment. As illustrated, the circles atop the housing are lumens 110behind which are housed microphones (as illustrated there are eightmicrophones). The dashed lines illustrate microphones in a lineararrangement 115 as well as in a circular arrangement 120. Many of theexamples described herein operate with these dual arrangements (e.g.,linear 115 and circular 120) with respect to a device 105. Although thedevice 105 here takes the form of the smart home gateway, otherconfigurations are contemplated, such as in a desktop or laptop computerconfiguration, a refrigerator or other appliance, etc.

A factor contributing to the far field performance drop for ASR mayinclude speech signal quality degradation due to some or all ofreverberations, echo, noise, or amplitude loss. For example, fromseveral experiments, four issues related to far field ASR where found:reverberation; echo; noise; and amplitude losses. The influence of oneor all of these factors may be mitigated by intelligently ordering avariety of processing techniques. For example, reverberation (e.g.,reverb) reduction enables use of beam-formers and noise reduction (NR)techniques that were not designed to work in reverberant conditions. Inanother example, acoustic echo cancelation (AEC) reduces echo generatedby internal loudspeakers. Also, for example, beam-formers and additionalpost-filtering modules reduce noise level. An automatic gain control(AGC) device counteracts amplitude losses. Overall the uniquecombination and order of the processing used in the described far fieldpre-processing pipeline enables accurate far field ASR.

An example of just such a pipeline in the device 105 may include asampler 125, a de-reverberator 127, a beam-former processor 130, astream selector 135, a filter 140, and a controller 145. Each of thesecomponents are implemented in electronic hardware, such as thatdescribed below (e.g., circuits).

The sampler 125 is arranged to obtain a plurality of audio channels.Thus, the sampler 125 may be a part of a microphone array, have a tap onmicrophone output, or have the plurality of audio channels delivered viaanother component of the device 105. In an example, an audio channel isaudio from a single microphone. In an example, an audio channel is audiofrom a plurality of microphones wherein the signal from thesemicrophones is correlated based on a physical arrangement of themicrophones, such as a spacing, linear or circular relationship, etc. Inan example, after obtaining the plurality of audio channels by thesampler 125, the de-reverberator 127 removes reverberation prior to thebeam-former processor partitioning the plurality of audio channels intobeams. Removing the reverberation may be accomplished using a variety oftechniques, such as short-time Fourier transform (STFT) domain inversefiltering methods, non-negative room impulse response (RIR) modeling,statistical RIR modeling, or nonlinear mapping (e.g., denoisingauto-encoder using a deep neural network or bidirectional longshort-term memory (BLSTM) recurrent neural network). After obtaining theplurality of audio channels by the sampler 125, and after applyingde-reverberation to the audio channels by the de-reverberator 127, anoutput may be directed to, or retrieved by, the beam-former processor130.

The beam-former processor 130 is arranged to partition the plurality ofaudio channels into beams. Here, beams refer to energy received from aspecific direction. Generally, given a single stationary microphone, thefrequency and amplitude of sound energy may be determined but there isnot enough information to also determine a direction. The addition of asecond microphone, (e.g., analogous to two human ear) provides twosignals that may be correlated in frequency and amplitude but which mayvary in time. With a known and fixed relationship between thesemicrophones, the variations of the audio signal in time may provide arelative direction of the energy. This may then be considered the beam.Thus, in an example, to partition the plurality of audio channels intobeams, the beam-former processor 130 is arranged to obtain (e.g.,receive or retrieve) the plurality of audio channels, partition theplurality of audio channels into partitions of two audio channels basedon a relationship between microphones producing the plurality of audiochannels, and provide each partition to the phase-based beam-former. Inthis example, the audio channel partitioning allows the beam-formerprocessor 130 or the phased-based beam-former to ascertain the timevariance (e.g., a measure of how in-phase the signals are) with a knownphysical arrangement of microphones. As explained earlier, this providesthe information to ascertain from what direction the energy (e.g.,sound) came from. Beamforming provides another level of control infinding a clean signal from which to process ASR.

The stream selector 135 is arranged to select a partition correspondingto a beam in the beams based on a noise level. In an example, to selectthe partition corresponding to the beam based on the noise level, thestream selector 135 is arranged to compare noise levels between thebeams and select the beam based on having the lowest noise levelsdetermined from the comparison. In an example, the stream selector 135uses a phrase quality scorer of the stream selector to compare the noiselevels across the beams. In an example, an SNR meter of the streamselector provides a noise level for each beam. The stream selector 135thus discriminates amongst a variety of possible input sources toprovide (e.g., to send or make available) a better signal to downstreamprocessors.

The filter 140 is arranged to reduce the level of noise in an audiosignal from the selected partition. In an example, to reduce the levelof noise in the audio signal from the selected partition, the filter 140applies noise reduction to the audio signal. In an example, to enhancethe speech signal from the selected partition, the filter applies aspectral profile matching (SPM) to the audio signal. In an example, thespectral profile matching is applied after noise reduction is applied tothe audio signal.

In an example, to boost the speech signal in the selected partition, thefilter 140 applies an automated gain control to the audio signal. In anexample, the automated gain control is applied after a spectral matchingprofile is applied to the audio signal.

In an example, the pipeline may optionally include a second filter (notillustrated) to perform acoustic echo cancellation to the plurality ofaudio channels. In an example, the acoustic echo cancellation isperformed prior to partitioning the plurality of audio channels intobeams. In an example, the second filter is part of the de-reverberator127.

The controller 145 is arranged to provide the audio signal to anexternal entity via an output interface of the pre-processing pipeline.Thus, the controller 145 interfaces with downstream components tofurther process the semantic content in an ASR system.

FIG. 2 is a block diagram of an example of a system 200 for far fieldautomatic speech recognition pre-processing, according to an embodiment.The system 200 includes additional examples of the components discussedabove. The components of the system 200 are implemented in electronichardware, such as that described above or below (e.g., circuits).

The system 200 includes a pipeline 205 for real-time far field ASR. Byordering the components of the system 200 as illustrated, ASR techniquesthat previously have been discarded in far field ASR due toreverberations may be reintroduced, such as:

-   -   the phase-based beam-former (PBF); and    -   the Spectral Profile Matching (SPM)

The far field pre-processing pipeline 205 may be composed of sixprocessing blocks: a de-reverberator 210, an optional AEC 215; abeam-former 220, a stream selector 230; a post-filtering block 245, anda content analysis block 265. In an example, the order of the far fieldpre-processing blocks is important (i.e., they must be in the orderpresent in FIG. 2). The far field pre-processing pipeline 205 mayoperate on a multichannel input. The multichannel input may be obtainedfrom a microphone array containing at least two microphones. In anexample, there is no upper limit for the number of microphones that maybe used. In an example, there are no limitations for the microphonearray geometry (e.g., linear, circular, etc.). In an example, the numberof microphones are an even number (e.g., the modulus of the number ofmicrophones and two is zero).

In the de-reverb block 210, reverberations are removed from themultichannel input. Parameters of the de-reverberation block 210 may beadjusted to balance computational complexity and performance. Techniquesto remove reverberation may include pre-configured room impulse models,or others, as described above.

In an example, the far field pre-processing pipeline 205 may be usedwith the device containing internal loudspeakers. In this example,acoustical leakage from the loudspeakers to the microphones may bereduced by the optional multichannel AEC block 215. In an example, theAEC block 215 includes one or more of the following properties:

-   -   it is located after the de-reverb block 210, thus the AEC block        215 analyses signals that are not affected by the room reverb;    -   it creates a cancelling filter using the multichannel reference        signal, which improves AEC performance due to additional        information that can be extracted from the different channels;        or    -   it is positioned before the beam-former block 220, not after the        beam-former block 220.

After the AEC block 215, the multichannel stream has had the room reverband loudspeaker echo removed (to the extent practical). Thus thebeam-former block 220 may use phase-based beam formers (PBFs) 225, orother beam forming techniques such as the Minimum VarianceDistortionless Response beam formers, to process the multichannelstream. Generally, for far field ASR, PBFs 225 cannot be used withoutremoving the echo and reverb because the PBF 225 generally requiresdirect sound in the microphone signals. In reverberant conditions thisrequirement is not met because reflections (e.g., none-direct signals)would also be captured. Consequently, the precise detection of userposition—an important feature in PBF 225 processing—will not bepossible. This issue worsens for distances between the user and thedevice greater than two meters. However, in the illustrated arrangement,nearly all reflections (e.g., most of their energy) are removed beforethe PBF 225 stage. Thus, it is possible to use PBFs 225 effectively.

The PBFs 225 use two signals coming from a microphone pair. Therefore,for microphone arrays with more than two microphones, multiple instancesof PBFs 225 may be used (e.g., one PBF 225 for each exclusive pair).Each PBF 225 instance may be steered toward different directions (e.g.,relative to the device). FIG. 3 illustrates directivity patterns of fourPBF 225 instances when used together with the microphone board describedherein. In FIG. 3 signals from eight microphones, two blank, twodiagonally striped, two diagonally cross-hatched, and two verticallycross-hatched, (grouped pairwise in the center with the center mostmicrophones in a group) are grouped in four steering pairs of coveredarea [i.e., the groups of 1) dashed with two dots, 2) dashed with onedot, 3) dashed, and 4) dotted]. As illustrated, sounds from each areapair are fed into the separate PBF 225 instances. As a result, thePBF-processed signals point towards four different directions with a45-degree beam width each. Since the PBF 225 processing isbi-directional—e.g., the same beam pattern for front and back facingdirections relative to a microphone pair, these directions beingperpendicular to a line drawn between the two microphones—the combinedsolution provides 360 degrees coverage (e.g., the circular long andshort dashed lines in FIG. 3).

In an example, owing to four directional streams, user localization ispossible. Thus, the stream selector 230 may assess each directionalstream against selected localization criteria, such as highestSignal-to-Noise Ratio (SNR)—e.g., calculated using the Signal LevelMeasurement (SLM) 270 or highest score of the Voice Activity Detector(VAD) 275 in the content analysis block 265—and select a stream moreconducive to ASR. The stream selector 230 may include one or more of aphrase quality scorer 235 or SNR meter 240 to provide localizationcriteria scores on the streams. Based on the localization criteria, onlyone of the PBF-processed streams may be selected for further processing(e.g., the stream with the highest SNR), by the stream selector 230.Because the selected stream (e.g., for further processing) isbeam-formed, the influence of noise coming from all directions (e.g.,areas not covered by the formed beam) is reduced and the user's speechis better exposed (e.g., more clear or less obstructed by that noise).This improves SNR leading to better far field ASR performance.

In an example, one or more post-filtering operations may be applied tothe streams by the post filtering block 245. Example post-filteringoperations may include:

-   -   NR 250—used to reduce remaining noise;    -   Spectral Profile Matching (SPM) 255—used to equalize the speech        signal to match frequency response of the ASRs training corpora;        or    -   AGC 260—used to normalize signal level.

In an example, the NR 250 may accept a reference stream containingPBF-processed signals that were classified by the stream selector block230 as noisy, at least compared to the other available streams (e.g.,beams pointing in a direction that is different than that of the user).In an example, noisy streams may be used to calculate a robustestimation of the noise floor that the NR 250 will remove.

In an example, the AGC block 260 uses a reference signal. In an example,the reference signal may be a typical loopback signal from the playbackpath.

Some experiments have shown that the SPM block 255 helps some ASRengines and the NR 250 helps for some other (e.g., different) ASRengines. Thus, in an example, the inclusion of one or more of thesecomponents is optional, providing further customization for performance,effectiveness, power use, design complexity, etc.

Output of the far field pre-processing pipeline may be provided to aclient 280 that may implement an ASR engine 285. In an example, however,the client 280 may implement a wake on voice (WoV) engine 290 or in aVoIP communication channel 295. FIG. 4 illustrates far field ASR WERimprovements obtained using the far field pre-processing pipeline 205.FIG. 4 illustrates far field ASR WER improvements for different noisetypes—LiRo: living room; SiSp: side speaker; Public: public place; andWork: work place—obtained using the far field pre-processing pipeline;unprocessed signals are the dashed line (on top) and processed signalsare the short dash-double dotted line (on bottom).

All of the blocks illustrated in FIG. 2 were implemented and evaluatedto find their influence on far field ASR performance. It was shown thatevery element of the pipeline introduces improvement. The improvementwas illustrated by the lower WERs obtained from multiple ASR engines infar field scenarios. Further, blocks were combined offline to simulatethe far field pre-processing pipeline. The simulation demonstratedbetter ASR performance compared to using the blocks individually. Thefar field pre-processing pipeline 205 was then ported to a real-timeaudio stack and used in the mock-up of a smart home gateway device(e.g., intelligent loudspeaker) illustrated in FIG. 1. Real-timedemonstrations with the mock-up exhibited the simulated far field ASRimprovements. Although the techniques discussed above are useful in farfield applications, they may be applied in near field ASR, or other ASRapplications (e.g., distances) as well.

FIG. 5 illustrates an example of a method 500 for automatic speechrecognition pre-processing, according to an embodiment. The operationsof the method 500 are implemented in electronic hardware, such as thatdescribed above or below (e.g., circuits).

At operation 505, a plurality of audio channels is obtained. In anexample, obtaining the plurality of audio channels includes removingreverberation prior to a beam-former processor partitioning theplurality of audio channels into beams.

At operation 510, the plurality of audio channels are partitioned intobeams. In an example, partitioning the plurality of audio channels intobeams includes receiving the plurality of audio channels at abeam-former processor, partitioning the plurality of audio channels intopartitions of two audio channels based on a relationship betweenmicrophones producing the plurality of audio channels, and providingeach partition to a phase-based beam-former.

At operation 515, a partition corresponding to a beam in the beams isselected based on a noise level. In an example, selecting the partitioncorresponding to the beam based on the noise level includes comparingnoise levels between the beams and selecting the beam based on havingthe lowest noise levels determined from the comparison. In an example, aphrase quality scorer of a stream selector performing the partitionselection compares the noise levels between the beams. In an example, asignal-to-noise (SNR) meter of the stream selector provides a noiselevel for each beam.

At operation 520, an speech signal is filtered from the selectedpartition. In an example, the filtering includes applying noisereduction to the audio signal. In an example, the filtering includesapplying a spectral matching profile (SPM) to the audio signal. In anexample, the SPM is applied after noise reduction is applied to theaudio signal.

In an example, the filtering includes applying an automated gain controlto the audio signal. In an example, the automated gain control isapplied after a spectral matching profile is applied to the audiosignal.

In an example, the method 500 may be extended by optionally performingacoustic echo cancellation to the plurality of audio channels. In anexample, the acoustic echo cancellation is performed prior topartitioning the plurality of audio channels into beams.

At operation 525, the filtered audio signal is provided to an externalentity via an output interface of the pre-processing pipeline.

FIG. 6 illustrates a block diagram of an example machine 600 upon whichany one or more of the techniques (e.g., methodologies) discussed hereinmay perform. In alternative embodiments, the machine 600 may operate asa standalone device or may be connected (e.g., networked) to othermachines. In a networked deployment, the machine 600 may operate in thecapacity of a server machine, a client machine, or both in server-clientnetwork environments. In an example, the machine 600 may act as a peermachine in peer-to-peer (P2P) (or other distributed) networkenvironment. The machine 600 may be a personal computer (PC), a tabletPC, a set-top box (STB), a personal digital assistant (PDA), a mobiletelephone, a web appliance, a network router, switch or bridge, or anymachine capable of executing instructions (sequential or otherwise) thatspecify actions to be taken by that machine. Further, while only asingle machine is illustrated, the term “machine” shall also be taken toinclude any collection of machines that individually or jointly executea set (or multiple sets) of instructions to perform any one or more ofthe methodologies discussed herein, such as cloud computing, software asa service (SaaS), other computer cluster configurations.

Examples, as described herein, may include, or may operate by, logic ora number of components, or mechanisms. Circuitry is a collection ofcircuits implemented in tangible entities that include hardware (e.g.,simple circuits, gates, logic, etc.). Circuitry membership may beflexible over time and underlying hardware variability. Circuitriesinclude members that may, alone or in combination, perform specifiedoperations when operating. In an example, hardware of the circuitry maybe immutably designed to carry out a specific operation (e.g.,hardwired). In an example, the hardware of the circuitry may includevariably connected physical components (e.g., execution units,transistors, simple circuits, etc.) including a computer readable mediumphysically modified (e.g., magnetically, electrically, moveableplacement of invariant massed particles, etc.) to encode instructions ofthe specific operation. In connecting the physical components, theunderlying electrical properties of a hardware constituent are changed,for example, from an insulator to a conductor or vice versa. Theinstructions enable embedded hardware (e.g., the execution units or aloading mechanism) to create members of the circuitry in hardware viathe variable connections to carry out portions of the specific operationwhen in operation. Accordingly, the computer readable medium iscommunicatively coupled to the other components of the circuitry whenthe device is operating. In an example, any of the physical componentsmay be used in more than one member of more than one circuitry. Forexample, under operation, execution units may be used in a first circuitof a first circuitry at one point in time and reused by a second circuitin the first circuitry, or by a third circuit in a second circuitry at adifferent time.

Machine (e.g., computer system) 600 may include a hardware processor 602(e.g., a central processing unit (CPU), a graphics processing unit(GPU), a hardware processor core, or any combination thereof), a mainmemory 604 and a static memory 606, some or all of which may communicatewith each other via an interlink (e.g., bus) 608. The machine 600 mayfurther include a display unit 610, an alphanumeric input device 612(e.g., a keyboard), and a user interface (UI) navigation device 614(e.g., a mouse). In an example, the display unit 610, input device 612and UI navigation device 614 may be a touch screen display. The machine600 may additionally include a storage device (e.g., drive unit) 616, asignal generation device 618 (e.g., a speaker), a network interfacedevice 620, and one or more sensors 621, such as a global positioningsystem (GPS) sensor, compass, accelerometer, or other sensor. Themachine 600 may include an output controller 628, such as a serial(e.g., universal serial bus (USB), parallel, or other wired or wireless(e.g., infrared (IR), near field communication (NFC), etc.) connectionto communicate or control one or more peripheral devices (e.g., aprinter, card reader, etc.).

The storage device 616 may include a machine readable medium 622 onwhich is stored one or more sets of data structures or instructions 624(e.g., software) embodying or utilized by any one or more of thetechniques or functions described herein. The instructions 624 may alsoreside, completely or at least partially, within the main memory 604,within static memory 606, or within the hardware processor 602 duringexecution thereof by the machine 600. In an example, one or anycombination of the hardware processor 602, the main memory 604, thestatic memory 606, or the storage device 616 may constitute machinereadable media.

While the machine readable medium 622 is illustrated as a single medium,the term “machine readable medium” may include a single medium ormultiple media (e.g., a centralized or distributed database, and/orassociated caches and servers) configured to store the one or moreinstructions 624.

The term “machine readable medium” may include any medium that iscapable of storing, encoding, or carrying instructions for execution bythe machine 600 and that cause the machine 600 to perform any one ormore of the techniques of the present disclosure, or that is capable ofstoring, encoding or carrying data structures used by or associated withsuch instructions. Non-limiting machine readable medium examples mayinclude solid-state memories, and optical and magnetic media. In anexample, a massed machine readable medium comprises a machine readablemedium with a plurality of particles having invariant (e.g., rest) mass.Accordingly, massed machine-readable media are not transitorypropagating signals. Specific examples of massed machine readable mediamay include: non-volatile memory, such as semiconductor memory devices(e.g., Electrically Programmable Read-Only Memory (EPROM), ElectricallyErasable Programmable Read-Only Memory (EEPROM)) and flash memorydevices, magnetic disks, such as internal hard disks and removabledisks; magneto-optical disks; and CD-ROM and DVD-ROM disks.

The instructions 624 may further be transmitted or received over acommunications network 626 using a transmission medium via the networkinterface device 620 utilizing any one of a number of transfer protocols(e.g., frame relay, internet protocol (IP), transmission controlprotocol (TCP), user datagram protocol (UDP), hypertext transferprotocol (HTTP), etc.). Example communication networks may include alocal area network (LAN), a wide area network (WAN), a packet datanetwork (e.g., the Internet), mobile telephone networks (e.g., cellularnetworks), Plain Old Telephone (POTS) networks, and wireless datanetworks (e.g., Institute of Electrical and Electronics Engineers (IEEE)802.11 family of standards known as Wi-Fi®, IEEE 802.16 family ofstandards known as WiMax®), IEEE 802.15.4 family of standards,peer-to-peer (P2P) networks, among others. In an example, the networkinterface device 620 may include one or more physical jacks (e.g.,Ethernet, coaxial, or phone jacks) or one or more antennas to connect tothe communications network 626. In an example, the network interfacedevice 620 may include a plurality of antennas to wirelessly communicateusing at least one of single-input multiple-output (SIMO),multiple-input multiple-output (MIMO), or multiple-input single-output(MISO) techniques. The term “transmission medium” shall be taken toinclude any intangible medium that is capable of storing, encoding orcarrying instructions for execution by the machine 600, and includesdigital or analog communications signals or other intangible medium tofacilitate communication of such software.

ADDITIONAL NOTES & EXAMPLES

Example 1 is a system for automatic speech recognition pre-processing,the system comprising: a sampler to obtain a plurality of audiochannels; a de-reverberator to remove reverberations from the pluralityof audio channels; a beam-former processor to partition the plurality ofaudio channels into beams after reverberations are removed; a streamselector to select a partition corresponding to a beam in the beamsbased on a noise level; a filter to reduce a noise level in a speechsignal from the selected partition; and a controller to provide theaudio signal to an external entity via an output interface of thepre-processing pipeline.

In Example 2, the subject matter of Example 1 optionally includes anecho cancelation block disposed between the de-reverberator and thebeam-former processor to cancel echoes from the plurality of audiochannels after the reverberations are removed and before the pluralityof audio channels are partitioned into beams.

In Example 3, the subject matter of any one or more of Examples 1-2optionally include wherein, to partition the plurality of audio channelsinto beams, the beam-former processor is to: receive the plurality ofaudio channels; partition the plurality of audio channels intopartitions of two audio channels based on a relationship betweenmicrophones producing the plurality of audio channels; and provide eachpartition to a phase-based beam-former.

In Example 4, the subject matter of any one or more of Examples 1-3optionally include wherein, to select the partition corresponding to thebeam based on the noise level, the stream selector is to: compare speechlevels between the beams; and select the beam based on having thehighest speech levels determined from the comparison.

In Example 5, the subject matter of any one or more of Examples 1-4optionally include wherein, to select the partition corresponding to thebeam based on the noise level, the stream selector is to: compare noiselevels between the beams; and select the beam based on having the lowestnoise levels determined from the comparison.

In Example 6, the subject matter of Example 5 optionally includeswherein the stream selector uses a phrase quality scorer of the streamselector to compare the noise levels between the beams.

In Example 7, the subject matter of Example 6 optionally includeswherein a signal-to-noise (SNR) meter of the stream selector provides anoise level for each beam.

In Example 8, the subject matter of any one or more of Examples 1-7optionally include wherein, to reduce the noise level in the speechsignal from the selected partition, the filter applies noise reductionto the audio signal.

In Example 9, the subject matter of any one or more of Examples 1-8optionally include wherein, to reduce the noise level in the speechsignal from the selected partition, the filter applies a spectralprofile matching (SPM) to the audio signal.

In Example 10, the subject matter of Example 9 optionally includeswherein the spectral profile matching is applied after noise reductionis applied to the audio signal.

In Example 11, the subject matter of any one or more of Examples 1-10optionally include wherein, to reduce the noise level in the speechsignal from the selected partition, the filter applies an automated gaincontrol to the audio signal.

In Example 12, the subject matter of Example 11 optionally includeswherein the automated gain control is applied after a spectral profilematching is applied to the audio signal.

In Example 13, the subject matter of any one or more of Examples 1-12optionally include a second filter to perform acoustic echo cancellationto the plurality of audio channels.

In Example 14, the subject matter of Example 13 optionally includeswherein the acoustic echo cancellation is performed prior topartitioning the plurality of audio channels into beams.

Example 15 is at least machine readable medium including instructionsfor a pre-processing pipeline, the instructions, when executed by amachine, causing the machine to perform operations comprising: obtaininga plurality of audio channels; removing reverberations from the audiochannels; partitioning the plurality of audio channels into beams afterreverberations are removed; selecting a partition corresponding to abeam in the beams based on a noise level; filtering an audio signal fromthe selected partition; and providing the filtered audio signal to anexternal entity via an output interface of the pre-processing pipeline.

In Example 16, the subject matter of Example 15 optionally includeswherein the operations include canceling echoes from the plurality ofaudio channels after the reverberations are removed and before theplurality of audio channels are partitioned into beams.

In Example 17, the subject matter of any one or more of Examples 15-16optionally include wherein the partitioning the plurality of audiochannels into beams includes: receiving the plurality of audio channelsat a beam-former processor: partitioning the plurality of audio channelsinto partitions of two audio channels based on a relationship betweenmicrophones producing the plurality of audio channels; and providingeach partition to a phase-based beam-former.

In Example 18, the subject matter of any one or more of Examples 15-17optionally include wherein the selecting the partition corresponding tothe beam based on the noise level includes comparing speech levelsbetween the beams and selecting the beam based on having the highestspeech levels determined from the comparison.

In Example 19, the subject matter of any one or more of Examples 15-18optionally include wherein the selecting the partition corresponding tothe beam based on the noise level includes comparing noise levelsbetween the beams and selecting the beam based on having the lowestnoise levels determined from the comparison.

In Example 20, the subject matter of Example 19 optionally includeswherein a phrase quality scorer of a stream selector performing thepartition selection compares the noise levels between the beams.

In Example 21, the subject matter of Example 20 optionally includeswherein a signal-to-noise (SNR) meter of the stream selector provides anoise level for each beam.

In Example 22, the subject matter of any one or more of Examples 15-21optionally include wherein the filtering includes applying noisereduction to the audio signal.

In Example 23, the subject matter of any one or more of Examples 15-22optionally include wherein the filtering includes applying a spectralprofile matching (SPM) to the audio signal.

In Example 24, the subject matter of Example 23 optionally includeswherein the spectral profile matching is applied after noise reductionis applied to the audio signal.

In Example 25, the subject matter of any one or more of Examples 15-24optionally include wherein the filtering includes applying an automatedgain control to the audio signal.

In Example 26, the subject matter of Example 25 optionally includeswherein the automated gain control is applied after a spectral profilematching is applied to the audio signal.

In Example 27, the subject matter of any one or more of Examples 15-26optionally include wherein the operations comprise performing acousticecho cancellation to the plurality of audio channels.

In Example 28, the subject matter of Example 27 optionally includeswherein the acoustic echo cancellation is performed prior topartitioning the plurality of audio channels into beams.

Example 29 is a method for automatic speech recognition pre-processing,the method comprising: obtaining a plurality of audio channels; removingreverberations from the audio channels; partitioning the plurality ofaudio channels into beams after the reverberations are removed;selecting a partition corresponding to a beam in the beams based on anoise level; filtering an audio signal from the selected partition; andproviding the filtered audio signal to an external entity via an outputinterface of the pre-processing pipeline.

In Example 30, the subject matter of Example 29 optionally includescanceling echoes from the plurality of audio channels after thereverberations are removed and before the plurality of audio channelsare partitioned into beams.

In Example 31, the subject matter of any one or more of Examples 29-30optionally include wherein partitioning the plurality of audio channelsinto beams includes: receiving the plurality of audio channels at abeam-former processor; partitioning the plurality of audio channels intopartitions of two audio channels based on a relationship betweenmicrophones producing the plurality of audio channels; and providingeach partition to a phase-based beam-former.

In Example 32, the subject matter of any one or more of Examples 29-31optionally include wherein selecting the partition corresponding to thebeam based on the noise level includes comparing speech levels betweenthe beams and selecting the beam based on having the highest speechlevels determined from the comparison.

In Example 33, the subject matter of any one or more of Examples 29-32optionally include wherein selecting the partition corresponding to thebeam based on the noise level includes comparing noise levels betweenthe beams and selecting the beam based on having the lowest noise levelsdetermined from the comparison.

In Example 34, the subject matter of Example 33 optionally includeswherein a phrase quality scorer of a stream selector performing thepartition selection compares the noise levels between the beams.

In Example 35, the subject matter of Example 34 optionally includeswherein a signal-to-noise (SNR) meter of the stream selector provides anoise level for each beam.

In Example 36, the subject matter of any one or more of Examples 29-35optionally include wherein the filtering includes applying noisereduction to the audio signal.

In Example 37, the subject matter of any one or more of Examples 29-36optionally include wherein the filtering includes applying a spectralprofile matching (SPM) to the audio signal.

In Example 38, the subject matter of Example 37 optionally includeswherein the spectral profile matching is applied after noise reductionis applied to the audio signal.

In Example 39, the subject matter of any one or more of Examples 29-38optionally include wherein the filtering includes applying an automatedgain control to the audio signal.

In Example 40, the subject matter of Example 39 optionally includeswherein the automated gain control is applied after a spectral profilematching is applied to the audio signal.

In Example 41, the subject matter of any one or more of Examples 29-40optionally include performing acoustic echo cancellation to theplurality of audio channels.

In Example 42, the subject matter of Example 41 optionally includeswherein the acoustic echo cancellation is performed prior topartitioning the plurality of audio channels into beams.

Example 43 is a system comprising means to perform any of the methods29-42.

Example 44 is at least one machine readable medium includinginstructions that, when executed by a machine, cause the machine toperform any of the methods 29-42.

Example 45 is a system for automatic speech recognition pre-processing,the system comprising: means for obtaining a plurality of audiochannels; means for removing reverberations from the plurality of audiochannels; means for partitioning the plurality of audio channels intobeams after the reverberations are removed; means for selecting apartition corresponding to a beam in the beams based on a noise level;means for filtering an audio signal from the selected partition; andmeans for providing the filtered audio signal to an external entity viaan output interface of the pre-processing pipeline.

In Example 46, the subject matter of Example 45 optionally includesmeans for canceling echoes from the plurality of audio channels afterthe reverberations are removed and before the plurality of audiochannels are partitioned into beams.

In Example 47, the subject matter of any one or more of Examples 45-46optionally include wherein the means for partitioning the plurality ofaudio channels into beams includes: means for receiving the plurality ofaudio channels at a beam-former processor; means for partitioning theplurality of audio channels into partitions of two audio channels basedon a relationship between microphones producing the plurality of audiochannels; and providing each partition to a phase-based beam-former.

In Example 48, the subject matter of any one or more of Examples 45-47optionally include wherein the means for selecting the partitioncorresponding to the beam based on the noise level includes means forcomparing speech levels between the beams and selecting the beam basedon having the highest speech levels determined from the comparison.

In Example 49, the subject matter of any one or more of Examples 45-48optionally include wherein the means for selecting the partitioncorresponding to the beam based on the noise level includes means forcomparing noise levels between the beams and selecting the beam based onhaving the lowest noise levels determined from the comparison.

In Example 50, the subject matter of Example 49 optionally includeswherein a phrase quality scorer of a stream selector performing thepartition selection compares the noise levels between the beams.

In Example 51, the subject matter of Example 50 optionally includeswherein a signal-to-noise (SNR) meter of the stream selector provides anoise level for each beam.

In Example 52, the subject matter of any one or more of Examples 45-51optionally include wherein the means for filtering includes means forapplying noise reduction to the audio signal.

In Example 53, the subject matter of any one or more of Examples 45-52optionally include wherein the means for filtering includes means forapplying a spectral profile matching (SPM) to the audio signal.

In Example 54, the subject matter of Example 53 optionally includeswherein the spectral profile matching is applied after noise reductionis applied to the audio signal.

In Example 55, the subject matter of any one or more of Examples 45-54optionally include wherein the means for filtering includes means forapplying an automated gain control to the audio signal.

In Example 56, the subject matter of Example 55 optionally includeswherein the automated gain control is applied after a spectral profilematching is applied to the audio signal.

In Example 57, the subject matter of any one or more of Examples 45-56optionally include means for performing acoustic echo cancellation tothe plurality of audio channels.

In Example 58, the subject matter of Example 57 optionally includeswherein the acoustic echo cancellation is performed prior topartitioning the plurality of audio channels into beams.

The above detailed description includes references to the accompanyingdrawings, which form a part of the detailed description. The drawingsshow, by way of illustration, specific embodiments that may bepracticed. These embodiments are also referred to herein as “examples.”Such examples may include elements in addition to those shown ordescribed. However, the present inventors also contemplate examples inwhich only those elements shown or described are provided. Moreover, thepresent inventors also contemplate examples using any combination orpermutation of those elements shown or described (or one or more aspectsthereof), either with respect to a particular example (or one or moreaspects thereof), or with respect to other examples (or one or moreaspects thereof) shown or described herein.

All publications, patents, and patent documents referred to in thisdocument are incorporated by reference herein in their entirety, asthough individually incorporated by reference. In the event ofinconsistent usages between this document and those documents soincorporated by reference, the usage in the incorporated reference(s)should be considered supplementary to that of this document; forirreconcilable inconsistencies, the usage in this document controls.

In this document, the terms “a” or “an” are used, as is common in patentdocuments, to include one or more than one, independent of any otherinstances or usages of“at least one” or “one or more.” In this document,the term “or” is used to refer to a nonexclusive or, such that “A or B”includes “A but not B,” “B but not A,” and “A and B,” unless otherwiseindicated. In the appended claims, the terms “including” and “in which”are used as the plain-English equivalents of the respective terms“comprising” and “wherein.” Also, in the following claims, the terms“including” and “comprising” are open-ended, that is, a system, device,article, or process that includes elements in addition to those listedafter such a term in a claim are still deemed to fall within the scopeof that claim. Moreover, in the following claims, the terms “first,”“second,” and “third,” etc. are used merely as labels, and are notintended to impose numerical requirements on their objects.

The above description is intended to be illustrative, and notrestrictive. For example, the above-described examples (or one or moreaspects thereof) may be used in combination with each other. Otherembodiments may be used, such as by one of ordinary skill in the artupon reviewing the above description. The Abstract is to allow thereader to quickly ascertain the nature of the technical disclosure andis submitted with the understanding that it will not be used tointerpret or limit the scope or meaning of the claims. Also, in theabove Detailed Description, various features may be grouped together tostreamline the disclosure. This should not be interpreted as intendingthat an unclaimed disclosed feature is essential to any claim. Rather,inventive subject matter may lie in less than all features of aparticular disclosed embodiment. Thus, the following claims are herebyincorporated into the Detailed Description, with each claim standing onits own as a separate embodiment. The scope of the embodiments should bedetermined with reference to the appended claims, along with the fullscope of equivalents to which such claims are entitled.

What is claimed is:
 1. A system for automatic speech recognitionpre-processing, the system comprising: a sampler to obtain a pluralityof audio channels; a de-reverberator to remove reverberations from theplurality of audio channels; a beam-former processor to partition theplurality of audio channels into beams after reverberations are removed;a stream selector to select a partition corresponding to a beam in thebeams based on a noise level; a filter to reduce a noise level in aspeech signal from the selected partition; and a controller to providethe audio signal to an external entity via an output interface of thepre-processing pipeline.
 2. The system of claim 1, comprising an echocancelation block disposed between the de-reverberator and thebeam-former processor to cancel echoes from the plurality of audiochannels after the reverberations are removed and before the pluralityof audio channels are partitioned into beams.
 3. The system of claim 1,wherein, to partition the plurality of audio channels into beams, thebeam-former processor is to: receive the plurality of audio channels;partition the plurality of audio channels into partitions of two audiochannels based on a relationship between microphones producing theplurality of audio channels; and provide each partition to a phase-basedbeam-former.
 4. The system of claim 1, wherein, to select the partitioncorresponding to the beam based on the noise level, the stream selectoris to: compare noise levels between the beams; and select the beam basedon having the lowest noise levels determined from the comparison.
 5. Thesystem of claim 1, wherein, to reduce the noise level in the speechsignal from the selected partition, the filter applies noise reductionto the audio signal.
 6. The system of claim 1, wherein, to reduce thenoise level in the speech signal from the selected partition, the filterapplies a spectral profile matching (SPM) to the audio signal.
 7. Thesystem of claim 6, wherein the spectral profile matching is appliedafter noise reduction is applied to the audio signal.
 8. The system ofclaim 1, wherein, to reduce the noise level in the speech signal fromthe selected partition, the filter applies an automated gain control tothe audio signal.
 9. The system of claim 8, wherein the automated gaincontrol is applied after a spectral profile matching is applied to theaudio signal.
 10. At least machine readable medium includinginstructions for a pre-processing pipeline, the instructions, whenexecuted by a machine, causing the machine to perform operationscomprising: obtaining a plurality of audio channels; removingreverberations from the audio channels; partitioning the plurality ofaudio channels into beams after reverberations are removed; selecting apartition corresponding to a beam in the beams based on a noise level;filtering an audio signal from the selected partition; and providing thefiltered audio signal to an external entity via an output interface ofthe pre-processing pipeline.
 11. The at least machine readable medium ofclaim 10, wherein the operations include canceling echoes from theplurality of audio channels after the reverberations are removed andbefore the plurality of audio channels are partitioned into beams. 12.The at least machine readable medium of claim 10, wherein thepartitioning the plurality of audio channels into beams includes:receiving the plurality of audio channels at a beam-former processor;partitioning the plurality of audio channels into partitions of twoaudio channels based on a relationship between microphones producing theplurality of audio channels; and providing each partition to aphase-based beam-former.
 13. The at least machine readable medium ofclaim 10, wherein the selecting the partition corresponding to the beambased on the noise level includes comparing noise levels between thebeams and selecting the beam based on having the lowest noise levelsdetermined from the comparison.
 14. The at least machine readable mediumof claim 10, wherein the filtering includes applying noise reduction tothe audio signal.
 15. The at least machine readable medium of claim 10,wherein the filtering includes applying a spectral profile matching(SPM) to the audio signal.
 16. The at least machine readable medium ofclaim 15, wherein the spectral profile matching is applied after noisereduction is applied to the audio signal.
 17. The at least machinereadable medium of claim 10, wherein the filtering includes applying anautomated gain control to the audio signal.
 18. The at least machinereadable medium of claim 17, wherein the automated gain control isapplied after a spectral profile matching is applied to the audiosignal.
 19. A method for automatic speech recognition pre-processing,the method comprising: obtaining a plurality of audio channels; removingreverberations from the audio channels; partitioning the plurality ofaudio channels into beams after the reverberations are removed;selecting a partition corresponding to a beam in the beams based on anoise level; filtering an audio signal from the selected partition; andproviding the filtered audio signal to an external entity via an outputinterface of the pre-processing pipeline.
 20. The method of claim 19,comprising canceling echoes from the plurality of audio channels afterthe reverberations are removed and before the plurality of audiochannels are partitioned into beams.
 21. The method of claim 19, whereinpartitioning the plurality of audio channels into beams includes:receiving the plurality of audio channels at a beam-former processor;partitioning the plurality of audio channels into partitions of twoaudio channels based on a relationship between microphones producing theplurality of audio channels; and providing each partition to aphase-based beam-former.
 22. The method of claim 19, wherein thefiltering includes applying a spectral profile matching (SPM) to theaudio signal.
 23. The method of claim 22, wherein the spectral profilematching is applied after noise reduction is applied to the audiosignal.
 24. The method of claim 19, wherein the filtering includesapplying an automated gain control to the audio signal.
 25. The methodof claim 24, wherein the automated gain control is applied after aspectral profile matching is applied to the audio signal.